This IPython notebook explains a basic workflow two tables using py_entitymatching. Our goal is to come up with a workflow to match DBLP and ACM datasets. Specifically, we want to achieve precision greater than 95% and get recall greater than 90%. The datasets contain information about the conference papers published in top databse conferences.
First, we need to import py_entitymatching package and other libraries as follows:
In [1]:
import sys
sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/py_entitymatching/')
import py_entitymatching as em
import pandas as pd
import os
In [2]:
# Display the versions
print('python version: ' + sys.version )
print('pandas version: ' + pd.__version__ )
print('magellan version: ' + em.__version__ )
Matching two tables typically consists of the following three steps:
1. Reading the input tables
2. Blocking the input tables to get a candidate set
3. Matching the tuple pairs in the candidate set
We begin by loading the input tables. For the purpose of this guide, we use the datasets that are included with the package.
In [3]:
# Get the paths
path_A = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'dblp_demo.csv'
path_B = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'acm_demo.csv'
In [4]:
# Load csv files as dataframes and set the key attribute in the dataframe
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
In [5]:
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))
In [6]:
A.head(2)
Out[6]:
In [7]:
B.head(2)
Out[7]:
In [8]:
# Display the key attributes of table A and B.
em.get_key(A), em.get_key(B)
Out[8]:
Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching. py_entitymatching provides four different blockers: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. The user can mix and match these blockers to form a blocking sequence applied to input tables.
For the matching problem at hand, we know that two conference papers published in different years cannot match, or if there are errors in the year then there should be at least some overlap between the paper titles. So we decide the apply the following blocking plan:
In [9]:
# Blocking plan
# A, B -- AttrEquivalence blocker [year] --------------------|
# |---> candidate set
# A, B -- Overlap blocker [title]---------------------------|
In [10]:
# Create attribute equivalence blocker
ab = em.AttrEquivalenceBlocker()
# Block tables using 'year' attribute : same year include in candidate set
C1 = ab.block_tables(A, B, 'paper year', 'paper year',
l_output_attrs=['title', 'authors', 'paper year'],
r_output_attrs=['title', 'authors', 'paper year']
)
len(C1)
Out[10]:
In [13]:
# Initialize overlap blocker
ob = em.OverlapBlocker()
# Block over title attribute
C2 = ob.block_tables(A, B, 'title', 'title', show_progress=False, overlap_size=2)
len(C2)
Out[13]:
In [14]:
# Combine the outputs from attr. equivalence blocker and overlap blocker
C = em.combine_blocker_outputs_via_union([C1, C2])
len(C)
Out[14]:
In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes. This typically involves the following four steps:
First, we randomly sample 450 tuple pairs for labeling purposes.
In [15]:
# Sample candidate set
S = em.sample_table(C, 450)
Next, we label the sampled candidate set. Specify we would enter 1 for a match and 0 for a non-match.
In [16]:
# Label S
#G = em.label_table(S, 'label')
For the purposes of this guide, we will load in a pre-labeled dataset (of 450 tuple pairs) included in this package.
In [17]:
# Load the pre-labeled data
path_G = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'labeled_data_demo.csv'
G = em.read_csv_metadata(path_G,
key='_id',
ltable=A, rtable=B,
fk_ltable='ltable_id', fk_rtable='rtable_id')
len(G)
Out[17]:
In this step, we split the labeled data into two sets: development (I) and evaluation (J). Specifically, the development set is used to come up with the best learning-based matcher and the evaluation set used to evaluate the selected matcher on unseen data.
In [18]:
# Split S into development set (I) and evaluation set (J)
IJ = em.split_train_test(G, train_proportion=0.7, random_state=0)
I = IJ['train']
J = IJ['test']
Selecting the best learning-based matcher typically involves the following steps:
In [19]:
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
Next, we need to create a set of features for the development set. py_entitymatching provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.
In [21]:
# Generate features
feature_table = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)
In [22]:
# List the names of the features generated
feature_table['feature_name']
Out[22]:
In [23]:
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I,
feature_table=feature_table,
attrs_after='label',
show_progress=False)
In [24]:
# Display first few rows
H.head(3)
Out[24]:
Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use 'precision' and 'recall' metric to select the best matcher.
In [27]:
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg], table=H,
exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
k=5,
target_attr='label', metric_to_select_matcher='precision', random_state=0)
result['cv_stats']
Out[27]:
In [29]:
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg], table=H,
exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
k=5,
target_attr='label', metric_to_select_matcher='recall', random_state=0)
result['cv_stats']
Out[29]:
We observe that the best matcher (RF) is getting us to the precision and recall that we expect (i.e P > 95% and R > 90%). So, we select this matcher and now we can proceed on to evaluating the best matcher on the unseen data (the evaluation set).
Evaluating the matching outputs for the evaluation set typically involves the following four steps:
As before, we convert to the feature vectors (using the feature table and the evaluation set)
In [30]:
# Convert J into a set of feature vectors using feature table
L = em.extract_feature_vecs(J, feature_table=feature_table,
attrs_after='label', show_progress=False)
Now, we train the matcher using all of the feature vectors from the development set. For the purposes of this guide we use random forest as the selected matcher.
In [31]:
# Train using feature vectors from I
dt.fit(table=H,
exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
target_attr='label')
Next, we predict the matches for the evaluation set (using the feature vectors extracted from it).
In [33]:
# Predict on L
predictions = dt.predict(table=L, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
append=True, target_attr='predicted', inplace=False)
Finally, we evaluate the accuracy of predicted outputs
In [34]:
# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'label', 'predicted')
em.print_eval_summary(eval_result)